Non-metric similarity 2024-12-18 - NISHIO Hirokazu's Scrapbox (Auto-translated from Japanese)

When a number of detailed topics are extracted and pages are compared to each other based on their combination, a "semantically satisfactory" connection can be defined more robustly. An attempt to measure semantic closeness at a higher level than mere overlap.

Improve accuracy by combining compound keywords and concepts:

By introducing a compound condition such as "X+Y," a narrower but semantically distinct context can be defined, and page-to-page links within that context can be emphasized.

These are all important points to consider when dealing with complex semantic spaces that do not fit into a mere similarity/dissimilarity dichotomy. Your thinking has evolved from simple similarity measures to representations in latent dimensions and linking strategies with complex conditioning, and such a multifaceted approach would be very useful in actual knowledge discovery and information seeking tasks.

---

I like the key phrase "semantically satisfactory connection."

---

o1 Pro.iconTo organize 10,000 short sentences instead of "distance space," the following approach is possible.

Conceptual organization by topic model (e.g., LDA):

It represents text in terms of latent topic distribution (probability distribution) rather than distance. An interpretable form of "which topics and how many topics are included" is obtained for each document, and documents can be understood by topic similarity and relevance without dropping into the distance space.

[Hierarchical Structuring by Feature Set and Concept Lattice (Formal Concept Analysis):

Each short document is considered as a set of features (vocabulary, labels, tags, etc.) and a concept lattice is constructed. This is a hierarchical organization based on "inclusions" rather than distances, allowing visualization and understanding of relationships between documents in a non-distal form.

Network representation (co-occurrence graph):

Co-occurrence relationships of words and phrases in short sentences are represented as a network (graph). Documents are understood as part of the link structure between nodes, and topical cohesion can be visualized and analyzed independently of distance, e.g., by Community Detection.

nishio.icon

If you simply use word, ha and so on co-occur all the time. Therefore, high frequency word is used as stopword.

On the other hand, low-frequency words may be useful in reality, but are not connected due to notational distortion.

Need a system for selecting keywords with appropriate granularity

Classification by rule-based tagging:

Specific keywords or patterns are extracted from the text and categorized based on rules and dictionaries. This is a method of organizing based not on "distance proximity" but on the axis of "Which category condition does this sentence satisfy?

These methods do not assume distance calculations, but allow us to organize and understand 10,000 short texts on different axes, such as probability distributions, hierarchical structures, network relations, rules and conceptual attributes.

This means "choose the one with overlap > 0".

the difference between looking from the top or from the middle.

relevance

dwell-think

---

This page is auto-translated from /nishio/非計量類似度2024-12-18 using DeepL. If you looks something interesting but the auto-translated English is not good enough to understand it, feel free to let me know at @nishio_en. I'm very happy to spread my thought to non-Japanese readers.